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ABSTRACT 

A general-purpose  scene-analysis  system  is 
described  which  uses  constraint -f iltering  tech- 
niques to  apply  domain  knowledge  in  the  interpre- 
tation of  the  regions  extracted  from  a segmented 
image.  An  example  is  given  of  the  configuration 
of  the  system  for  a particular  domain,  FLIR 
(Forward  Looking  InfraRed)  Images,  as  well  as 
results  of  the  system's  performance  on  some  typi- 
cal Images  from  this  domain. 

1.  Introduction 

An  image  (whether  on  the  human  retina,  on 
photographic  film,  or  in  some  electronic  device) 
is  formed  by  a complicated  interaction  of  light 
with  objects  in  three-dimensional  space  (a  scene). 
Scene  analysis  is  the  process  of  unravelling  this 
interaction:  inferring  from  an  image  the  arrange- 

ment of  lighting  and  objects  that  produced  it.  In 
theory,  this  problem  is  indeterminate:  A given 

image  may  result  from  many  different  scenes,  all 
of  which  happen  to  appear  identical  from  the  ob- 
server's viewpoint.  But  In  practice  there  are 
usually  sufficient  restrictions  on  allowable 
scenes  to  permit  essentially  only  one  Interpreta- 
tion of  the  image.  The  problem  is  to  find  this 
Interpretation  efficiently.  Humans  are  clearly 
able  to  do  this.  Can  computers  achieve  slmll- 
performance? 

In  this  paper  we  present  a method  for  scene 
analysis  based  on  the  application  of  constraint- 
filtering  techniques  to  a network  of  regions  ex- 
tracted from  an  image.  Such  an  approach  has  two 
chief  advantages.  First,  its  conceptual  simpli- 
city: It  provides  a clean  separation  between  the 

general  processing  algorithm  and  the  knowledge 
about  a particular  domain,  which  is  expressed 
declaratively  as  constraints.  Second,  Its 
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computational  speed:  Constraint-filtering  can  be 

decomposed  into  many  almost  independent  processes 
which  can  all  be  run  in  parallel  on  a suitable 
multiprocessor  computer. 

To  try  this  approach,  we  have  implemented  a 
prototype  system  that  does  scene  analysis  by  con- 
straint filtering.  A diagram  giving  an  Informal 
overview  of  the  system  is  shown  in  Figure  1.  In 
the  interest  of  expediency  we  have  made  many  sim- 
plifications. For  example,  only  a few  crude 
measurements  are  made  on  the  extracted  regions.  In 
the  following  sections,  we  will  describe  the  proto- 
type system,  taking  note  of  these  simplifications. 

Ue  will  also  show  how  the  system  is  used  in  a 
particular  domain  — forward-looking  infrared  (FLIR) 
images  of  battlefield  scenes.  This  domain  was 
chosen  partly  because  of  its  military  interest,  but 
primarily  because  its  moderate  complexity  is  just 
about  right  for  fully  exercising  the  prototype 
system.  Then  we  will  discuss  the  systas's  perfor- 
mance, taking  care  to  distinguish  those  failures 
that  are  inherent  in  the  method  from  those  that  are 
merely  the  result  of  simplifications  made  in  this 
implementation,  and  finally,  we  will  suggest  direc- 
tions for  further  progress. 

2.  Segmentation 

A digital  image  is  merely  an  array  of  light 
intensity  (or  color)  values.  There  seems  to  be  no 
way  of  going  directly  from  these  values  to  a 
description  of  a scene  in  terms  of  the  objects  in 
it.  As  argued  by  Barrow  and  Tenenbaum  [],2], 

Marr  (3),  and  numerous  others,  several  stages  of 
processing  are  needed,  each  with  its  own  intermed- 
iate representations  of  the  information  contained 
in  the  image.  A first  step  is  to  organize  the 
pixels  into  groupings  that  correspond  more  closely 
to  the  objects  in  the  scene. 

Typically,  this  is  done  by  segmenting  the  image 
into  regions  of  fairly  homogeneous  brightness.  For 
many  scenes,  this  is  a reasonable  thing  to  do.  In 
most  cases,  the  regions  will  correspond  to  the  ob- 
jects themselves,  or  else  to  significant  pieces  of 
them.  By  this  means  the  myriads  of  pixels  in  an 
image  can  be  reduced  to  a few  score,  or  a few 
hundred  regions,  considerably  decreasing  the  amount 
of  data  t;hat  must  be  processed,  but  with  little  loss 
of  information.  Furthermore,  since  regions  more 
closely  correspond  to  objects,  expectations  about 
the  appearance  of  objects  can  be  more  readily 
applied  to  the  regions  than  to  unorganized  pixels. 
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However,  segmentation  into  regions  is  not 
without  problems.  An  initial  process  of  segmenta- 
tion normally  applies  a single  criterion  of  homo- 
geneity over  the  entire  image.  Unfortunately,  a 
difference  in  brightness  that  is  insignificant  in 
some  contexts  (such  as  a fluctuation  in  a textured 
background)  may  well  be  very  significant  in  other 
contexts  (such  as  part  of  the  border  of  an  object 
with  its  surroundings).  Most  segmentation  pro- 
cesses take  little  account  of  this  sort  of  con- 
textual information,  and  so  make  errors  of  two 
sorts:  over segmentat ion  and  undersegmentat ion . 

Oversegmentation  breaks  into  pieces  what  should 
ideally  be  a single  region.  Properties  of  the 
region  as  a whole  (such  as  shape  and  area)  and 
relations  with  other  regions  (such  as  adjacency 
and  surroundedness)  cannot  be  computed  directly, 
but  can  only  be  recovered  by  attempting  to  merge 
pieces  that  are  likely  to  belong  to  the  same 
region.  More  serious  is  undersegmentat ion,  by 
which  several  regions  that  should  be  distinct  are 
fused  together.  Again,  region  properties  and 
relations  are  lost,  but  recovering  them  is  a more 
difficult  business  of  attempting  to  split  the 
fused  region  into  parts. 

Several  attempts  have  been  made  to  overcome 
this  problem.  Tenenbaum  and  Barrow  in  IGS 
(Interpretation  Guided  Segmentation)  [4]  used  do- 
main knowledge  to  guide  the  low  level  segmentation. 
Constraints  about  the  relationships  betweei  ob- 
jects were  used  to  guide  the  merging  of  pixels  into 
regions.  Feldman  and  Yakimovsky  [5]  also  used 
semantic  constraints  to  guide  segmentation. 

Another  approach,  used  by  Nagao  and  Matsuyama  [6] , 
first  performs  as  unguided  segmentation  and  later 
corrects  the  errors  in  this  segmentation  by  a 
semantically  controlled  process  of  merging  and 
splitting  regions.  We  assume  that  under segmenta- 
tion never  occurs,  and  that  oversegmentat ion  is  not 
serious:  that  an  object  is  at  worst  broken  into 

two  or  three  pieces.  We  augment  our  domain  model 
to  cover  fragments  of  objects,  but  without  making 
any  attempt  to  integrate  them  into  wholes.  For  the 
simple  domain  used  as  an  example,  the  initial  seg- 
mentation can  usuall  be  fine-tuned  by  hand  to  fit 
our  assumptions  above.  Even  so,  failures  are  not 
uncommon,  indicating  that  a more  subtle  treatment 
of  segmentation  errors  is  needed. 

First  we  smooth  the  image  using  an  edge- 
preserving smoothing  technique  in  order  to  reduce 
noise.  The  particular  technique  used  does  not 
matter  greatly,  but  usually  we  have  used  Narayanan 
and  Rosenfeld’s  histogram-guided  smoothing  tech- 
nique [7],  which  has  proved  quite  effective.  Next, 
we  requantize  the  image  into  a small  number  of 
gray  levels  (typically  five),  following  the  peak 
structure  of  the  histogram  of  the  smoothed  image. 

After  this,  the  regions  themselves  can  be  ex- 
tracted by  a connected  components  analysis.  At  the 
same  time,  we  make  a few  measurements  on  each 
region;  these  measurements  serve  as  a description 
of  the  region  for  all  subsequent  processing.  We 
construct  the  bounding  upright  rectangle  around 
each  region  (see  Figure  2)  and  measure  the  image 
location  of  its  lower  left  corner,  its  width  and 


height,  as  well  as  the  area  and  average  brightness 
of  the  region  itsei . 

As  mentioned  above,  these  measurements  provide 
only  a crude  description  of  each  region,  but  suffi- 
cient for  this  prototype  system.  A full  implemen- 
tation would  need  a more  complete  description  of 
shape,  perhaps  the  chain  code  of  the  boundary  of 
each  region.  Since  any  region  description  is 
necessarily  incomplete,  it  may  ultimately  be  neces- 
sary to  refer  back  to  the  original  image  to  check 
for  properties  that  cannot  easily  or  efficiently  be 
extracted  by  preprocessing  operations. 

3.  Constraint  filtering 

After  segmentation,  scene  analysis  becomes 
mostly  a matter  of  labelling  the  regions  with  their 
identifications  as  objects  or  object  parts.  (For 
now  we  ignore  the  problem  of  organizing  the  parts 
of  objects  into  wholes.)  Clearly,  only  those 
labellings  are  valid  that  can  be  derived  from  an 
arrangement  of  real  cbjects  in  space.  Properties 
of  objects,  and  relationships  between  them,  imply 
corresponding  properties  and  relationships  of  the 
image  regions  that  result  from  these  objects. 

These  projected  properties  and  relationships  con- 
strain the  possible  labelling  of  regions  with 
object  identifications.  Thus  scene  analysis  can 
be  reduced  to  a constraint  satisfaction  problem. 

(The  early  work  of  Barrow  et  al . [8,9],  used  this 
approach,  with  the  constraints  derived  from  a re- 
lational structure  which  provided  a single  but 
inflexible  scene  model.) 

The  traditional  technique  for  solving  such 
problems  is  backtracking.  However,  backtracking  is 
inherently  a sequential  technique,  which  does  not 
lend  itself  well  to  parallel  processing.  Even  if 
we  restrict  our  attention  to  sequential  processing, 
simple  backtracking  has  a serious  defect,  espe- 
cially on  the  problems  arising  in  scene  analysis: 

It  suffers  greatly  from  "thrashing"  beha\ ior  [10, 
11,12].  When  a failure  is  discovered,  only  the  most 
recent  labelling  is  reconsidered.  If  the  true 
cause  of  the  failure  lies  in  an  earlier  labelling, 
it  will  t .ke  the  program  many  steps  of  blind  back- 
tracking lefore  it  can  undo  the  incorrect  labelling. 
To  overcome  these  problems,  a number  uf  authors 
[10,11,14,15,16,17,18,21]  have  proposed  "constraint 
filtering",  "relational  consistency",  or  "discrete 
relazation"  techniques  for  constraint  satisfaction 
problems.  Some  have  emphasized  the  suitability  of 
these  methods  for  parallel  processing,  while  others 
have  stressed  the  avoidance  of  thrashing.  We  feel 
that  the  chief  advantage  of  these  methods  lies  in 
their  potential  parallelism,  especially  since 
Gaschnig  [13]  has  shown  that  more  sophisticated 
backtracking  methods  can  outperform  sequential 
implementations  of  constraint  filtering. 

In  order  to  perform  constraint  filtering,  it 
is  necessary  that  those  nodes  (regions)  that  con- 
strain each  other  be  connected  in  a network.  It  is 
at  least  theoretically  possible  for  the  labexling 
of  a region  to  be  influenced  by  any  other  region  in 
the  image,  so  ideally  the  constraint  network  should 
be  a complete  graph,  connecting  each  region  to  every 


other  region.  But  in  practice  this  is  not  feasi- 
ble, since  the  number  of  interconnections  grows  as 
the  square  of  the  number  of  regions  — far  too  fast. 
Having  too  many  interconnections  is  undesirable  for 
two  reasons:  First,  it  increases  the  cost  of  build- 

ing a hardware  network  for  constraint  filtering 
(though  some  architectures,  such  as  that  of  ZMOB 
[19],  permit  arbitrary  interconnection  at  no  extra 
hardware  cost).  Second,  and  more  important,  it 
increases  processing  time,  since  the  amount  of 
computation  done  for  each  region  is  roughly  pro- 
portional to  the  number  of  regions  it  is  connected 
to. 

Therefore,  it  is  desirable  to  limit  the  num- 
ber of  interconnections.  This  must  be  done  care- 
fully, since  the  correctness  and  effectiveness  of 
the  constraint  filtering  depend  on  the  complete- 
ness of  the  interconnections,  and  the  lack  of  a 
necessary  connection  might  prevent  or  mislead  the 
application  of  an  important  constraint.  We  would 
like  to  build  a network  of  interconnections  that 
is  as  sparse  as  possible,  but  still  produces  the 
same  results  as  the  complete  graph.  Obviously, 
this  cannot  be  known  ahead  of  time— the  best  we 
can  do  is  to  connect  those  regions  that  have  a 
good  chance  of  being  relevant  to  each  other. 

This  is  a matter  that  requires  much  further  inves- 
tigation,  but  for  now  we  have  implemented  a simple 
notion  of  relevance:  A region  is  connected  to  all 

regions  that  are  very  close  to  it  (because  these 
make  up  its  immediate  context),  and  to  all  very 
large  regions  in  the  image  (because  these  give  a 
good  basis  for  judging  it  in  its  global  context). 

Thus  the  number  of  interconnections  is  roughly 
constant  for  each  node  and  overall  is  proportional 
to  the  number  of  regions  in  the  image.  This  inter- 
connection scheme  is  imperfect,  but  appears  to  work 
with  few  errors,  at  least  for  the  domain  of  FLIR 
Images  used  in  this  report. 

Once  the  configuration  of  the  network  is  com- 
plete, the  constraint  filtering  proper  can  begin: 

We  first  of  all  attach  to  each  region  a list  con- 
taining all  the  labellings  that  it  might  possibly 
bear.  (Currently,  these  will  be  all  labellings 
possible  in  the  domain,  although  it  should  be 
possible  to  use  context  and  the  taxonomy  of  labels 
to  reduce  this  initial  list  considerably.)  Each 
label  has  associated  ’-'ith  it  £ special  "when- 
proposed"  procedure,  which  is  executed  for  each 
region  whenever  that  label  is  first  proposed  for 
the  region.  This  permits  the  calculation  of  cer- 
tain parameters  that  make  sense  only  if  the  region 
is  interpreted  as  a particular  sort  of  object.  For 
example,  if  a region  is  hypothesized  to  correspond 
to  an  object  of  a certain  intrinsic  size,  then  it 
may  be  useful  to  use  the  region's  apparent  size,  in 
conjunction  with  the  camera  geometry,  to  compute 
the  object's  range  and  location  in  space.  Notice, 
however,  that  this  computation  makes  sense  only 
under  this  hypothesis. 

Next,  the  label  lists  are  filtered  using  what 
are  called  here  "unary  constraints".  That  is 
knowledge  about  the  intrinsic  properties  of  objects 
is  used  to  eliminate  incorrect  labellings  from  each 
region's  label  list.  Regarded  as  propositions, 
the  unary  constraints  have  the  form:  "If  a region 


is  to  bear  this  label,  then  the  region  must  have 
these  properties".  Tne  constraint  is  actually  used 
in  the  contrapositive  form:  "If  a region  does  not 

have  these  properties,  then  it  cannot  bear  this 
label."  These  properties  may  be  immediate  proper- 
ties of  the  region,  or  they  may  be  those  computed 
indirectly  by  the  appropriate  when-proposed  pro- 
cedure. 

Clearly,  these  three  steps  (hypothesizing 
labels,  computing  parameters,  and  filtering  labels) 
could  be  done  at  one  swoop,  with  some  improvement 
in  computational  efficiency.  Here  they  are  done 
separately,  for  clarity  of  presertation  and  ease  of 
programming. 

After  the  region  labels  hav ? been  filtered,  we 
can  attach  to  each  interconnection  (or  arc)  a list 
of  label  pairs  that  is  the  cross-product  of  the 
sets  of  labels  on  the  two  regions  at  either  end  of 
the  arc.  This  list  represents  the  joint  labellings 
that  are  simultaneously  possible  for  the  two  regions 
considered.  Then  all  these  label-pair  lists  can  be 
filtered  by  binary  constraints,  that  is,  those  joint 
labellings  can  be  eliminated  that  violate  a con- 
straint on  the  labelling  of  pairs  of  regions.  These 
constraints  have  the  propositional  form:  "If  two 

regions  (say  r^  and  r2)  are  to  simultaneously  bear 
the  labels  and  l2  respectively,  then  r!  and  r2 
must  stand  in  certain  relations  to  each  other." 

Again  the  constraint  is  used  in  its  contrapositive 
form:  If  the  two  regions  fail  to  stand  in  the  re- 

quired relations  to  each  other,  the  appropriate 
pa.r  of  labels  can  be  deleted  from  the  arc  joining 
them.  6 

Following  all  this,  three  more  filtering  pro- 
cesses can  be  applied.  One  of  them,  filtering  by 
existential  constraints,  enforces  constraints  of 
the  following  form:  "If  a region  is  to  bear  a 

certain  label  then  there  must  exist  other  regions 
that  have  certain  properties  and  stand  in  certain 
relationships  with  the  given  region."  Thi  is 
very  much  like  a unary  constraint,  except  that  the 
properties  of  the  other  regions  include  the  require- 
ment that  they  bear  certain  labels,  and  that  those 
labels  are  permitted  simultaneously  with  the 
labelling  to  which  the  existential  constraint  is 
being  applied.  Thus  existential  constraints  must 
examine  the  arc  labellings.  Unary  constraints 
need  be  applied  just  once,  but  since  the  allowable 
labellings  of  arcs  change  during  the  constraint 
processing,  an  existential  constraint  that  is 
satisfied  early  may  later  be  violated  because  a 
labelling  that  it  depended  on  has  been  rejected. 

Hence  the  filtering  by  existential  constraints 
should  be  redone  every  lime  the  arc  labellings 
change. 

The  other  two  filtering  processes,  arc-upon- 
node  interaction  and  node-upon-arc  interaction, 
attempt  to  enforce  consistency  between  the  node 
labellings  and  arc  labellings.  The  first  process 
ensures  that  every  node  labelling  has  support  from 
every  arc  that  impinges  on  it.  By  "support"  we 
mean  that  there  exists  on  each  arc  at  least  one 
label  pair  that  has  the  same  label  as  the  region 
for  its  first  or  second  component,  as  appropriate, 
depending  on  which  end  of  the  arc  the  node  lies  at. 
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If  a node  labelling  lacks  support  it  is  deleted 
The  second  process  ensures  that  every  arc  labelling 
Ss  support  from  the  nodes  at  either  end  of  it. 
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These  three  interdependent  filtering  pro- 
cesses provide  a simple  but  effective  way  of  pro 
oaeating  inferences  about  the  ident  if  icat  on 
regions^ through  the  constraint  network  They 
provide  the  system  with  a rudimentary  form  of 
reasoning  about  scenes  in  the  sense  that  its  con 
elusions,  if  justified  logically,  would  take  sev- 
eral proof  steps  from  the  given  axioms  (these  a 

The  region  properties  and  relations  and  he  do- 
main constraints).  Of  course,  all  this  reasoning 
...  done  by  a mechanical  process  of  propagating  th 
effects  of  deleting  node  and  arc  labellings,  u 
can  be  regarded  as  a limited  form  of  logical  de- 
duction. 

Notice  that  all  the  filtering  processes  work 
strictly  by  refuting  and  eliminating  labellings. 

This  means  that,  after  the  initial  labelling  gen 
eration  processing,  all  the  filtering  process 
*”ld  be  run  independently  and  asynchronously  on 

the  tn  J,  order,  Che 

° TViat  i ^ the  results  oi  >_ne 

COnndstPa°i«  filtering'will  be'the  same,  no  matter  in 
wtot  order  the  individual  filtering  processes  are 
applied  to  each  node,  provided  all  processes. are 
annlied  until  the  network  stabilizes-when  no  fur 
till  deletion  of  labellings  can  be  ^de.  However, 

^ -tni-prests  of  efficiency  and  simplicity  and 

IP  order" to1 simulate  an  actual  parallel  implemen- 
tation, we  apply  the  various  processes  synch™n" 
ously  in  parallel  over  the  entire  network.As 
described  above,  we  first  perform  all  the  initial 
node  labellings,  next  all  the  node  f liter  ng  y 

^ zzr 

S S “lt 

labellings  using  the  node  labellings. 
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After  the  constraint  filtering  has  stabilized 
and  terminated,  we  can  turn  our  attention  to  t 
interpretation  of  its  results.  Unfortunately, 
these  results  will  not  necessarily  be  correct  i 
tSe  sense  of  being  a valid  solution  to  the  given 
constraint  satisfaction  problem.  Ideally  we 
would  like  to  see  every  region  correctly 
uniquely  labelled  with  its  identification  as  an 
object  or  object  part.  However,  given  the  way  we 
have  decomposed  the  problem  so  as  to  make  it 
amenable° to  parallel  processing,  such  an  outcome 
cannot  be  guaranteed.  Before  discnosing  these e 
erroneous  results  in  detail,  we  should  stress  that 


both  in  the  example  domain  used  here,  and  in  other 

a r 17  201  we  have  not  found  these  errors  to 

domains  w nt-vmr  authors 

be  a serious  problem  in  practice.  Other  authors 

[21],  report  similar  findings. 

„„  of  error  fs  th.t  .ft.r  the  fllt.ring 

O region  r.t.in  .=ver>l  l.bels,  ™' 

r>ssn:  rr;£.f”r.  « 

^ So  I given  region  ..,■»«  «=  j'” 

h«r.;dss^»-£Sun,  s 

f icetions'of  neighboring  regions  ° 

the  »en~  ^i™  h e n taril,  he 

rmrltl  From  one  point  ot  view, 

considered  nn  errL,  • region  sen  hove  sever, 
different  interpretations,  and  all  of  these  ar 
retained.  But  if  a number  of  regions  all  hav 
multiple  labels, it  may  be  of  interest  to  dis 
cover  which  unique  labellings  of  all  of  them  are 
simultaneously  possible,  and  this  sort  o un 
elling  cannot  be  done  by  mere  constraint  filtering. 

,0  Obi.  1.  ‘h'.SrdinCS“  sshs  =■- 

Labelling:  There  may  ■ principle,  but 

biguity  that  can  be  resolved^  filter- 

cannot  be  resolved  y p simultaneous  ex- 

ing— its  resolution  requires  the  s^(  nodes. 

amlnation  of  th.  since  «b=g 

Errors  such  as  these  most  domains  it 

tend  to  occur  infrequ' en  V ig  sufflcient  for 

seems  that  pa^rw  se  n tion.  Even  when 

essentially  unambiguous  int L Presolved>  by  some 

they  occur  they  can  ea  y aione  or  in  combi- 
sort  of  backtracking  technique  alone,  ^ ^ ^ 

nation  with  further  cons  g [21],  and  by 

bY  Barrow  a;dc Je7;oba(T516]  ln  mo^  cases,  the 

have  not  implemented  such  a Post"pr°^Si^^est 
for  the  current  system  because  our  main  interes 

is  in  the  filtering  itself. 

It  is  worth  remarking  that  if  only  unary  and 
binary  constraint  filtering  are  used,  or  listen  - 
ial  constraints  are  used  but  the  network  is  suffi- 
ciently complete,  extra  labelling  (as  discussed 

UndeC  tCsC^^Lt^eC c^t^ainff^ Uter ing  iill 

- iss  si  ^o„,t 

labellings.  This  means  that  if  every  r g 
a sinfele  label,  then  we  can  be  sure  that  all  these 
li-  bell  ing  s comprise  the  unique,  b°fU 

to  the  constraint  satisfaction  problem.  If  any 
«giHns  bear  .uUipl.  Ub.l.,  then  » to. 
unique  interpretations  could  be  found,  1 
sary  by  a later  backtracking  process.  Unfor 

y’  / existential  constraints  are  used,  then 

r She  “S  °f  tbe  netuork  e.nn.t  be 

unless  lb  is  . gr.ph,  “bleb  is 

seldom  feasible  in  practice. 

The  other  sort  of  error  that  can  occur  is  that 
a correct  labelling  is  mistakenly  rejected.  As  im 
plied  above,  this  can  only  happen  when  existen 
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constraints  are  applied  to  a network  that  lacks 
some  necessary  interconnections.  Since  the  com- 
pleteness of  the  interconnections  cannot  always  be 
determined,  ahead  of  time,  it  cannot  be  foreseen 
whether  such  errors  will  occur.  If  they  do  occur, 
they  they  are  irreparable,  since  a label  once  lost 
cannot  conveniently  be  reinstated.  But  once  again, 
these  errors,  while  theoretically  possible,  have 
not  occurred  in  our  examples  because  our  simple 
configuration  rule  mentioned  earlier  ensures  suffi- 
cient interconnection  in  the  network — at  least  for 
the  existential  constraints  used  in  the  example 
domain. 

Related  to  this  is  a problem  that  occurs  if  a 
region  loses  all  of  its  labellings,  that  is,  all 
possible  identifications  of  it  can  be  refuted. 

This  means  that  the  region  is  unrecognizable  as 
anything  from  the  assumed  domain.  But  once  any 
node  in  the  network  is  unrecognizable,  its  effect 
will  be  propagated  until  all  nodes  lose  all  their 
labels.  Strictly  speaking,  this  is  perfectly  cor- 
rect: If  a scene  contains  objects  that  cannot  be 

recognized,  then  the  scene  could  not  possibly  be 
from  our  chosen  domain,  and  therefore  the  whole 
scene  is  essentially  unrecognizable.  The  problem 
is  that  the  constraint  filtering  implicitly  assumes 
that  the  set  of  labels  and  constraints  correctly 
account  for  everything  that  might  possibl y appear . 

If  this  assumption  is  violated,  then  the  entire 
image  must  be  rejected,  even  though  the  image  could 
be  successfully  interpreted  if  the  alien  object 
were  not  there.  While  theoretically  justifiable, 
this  behavior  is  undesirable  in  practice.  If  such 
a vision  system  were  turned  loose  on  the  world,  we 
would  not  want  it  to  effectively  go  blind  every 
time  an  unexpected  object  chanced  into  its  field  of 
view.  One  solution  to  this  problem  is  to  postulate 
a catch-all  label  for  which  there  are  no  constraints 
whatsoever.  Any  region  in  the  image  can  therefore 
bear  this  label,  even  those  that  are  otherwise  un- 
recognizable. Of  course,  all  recognizable  regions 
will  also  bear  this  label  in  addition,  and  there 
will  be  numerous  additional  label  pairs  attached  to 
the  arcs.  This  will  cause  no  problem  with  the  in- 
terpretation, but  it  does  introduce  a certain  com- 
putational overhead  which  nay  not  be  negligible. 
Another  solution,  which  does  not  suffer  from  these 
problems,  is  this:  When  a node  loses  all  its  labels, 

it  should  be  marked  as  unrecognizable,  and  then  re- 
moved from  the  network,  with  its  connecting  arcs  as 
well,  so  that  the  undesirable  effects  cannot  spread 
further.  Both  of  these  solutions  have  ramifications 
that  we  shall  not  go  into  here.  Because  of  this, 
and  because  the  problem  only  arises  when  the  model 
embodied  in  the  constraints  is  inadequate,  we  have 
not  made  any  special  provision  for  handling  it  in 
the  prototype  system.  If  any  region  is  found  to  be 
unrecognizable,  we  go  back  and  revise  the  model  to 
account  for  the  misrecognition. 

This  brings  us  to  one  final  matter:  How  are 

the  constraint  models  for  a particular  domain  con- 
structed in  the  first  place?  Winston  [22]  has  pro- 
posed an  automatic  system  for  building  scene  models, 
that  uses  inductive  Inference  over  a set  of  training 
examples.  Such  an  approach  is  certainly  possible 
here,  but  the  problem  of  separating  relevant  from 
irrelevant  features  can  be  expected  to  be  very  dif- 


ficult for  all  but  the  simplest  scenes.  For  now, 
we  expect  that  models  will  be  built  by  hand.  A user 
of  the  system  relies  on  his  own  introspection  and 
knowledge  of  the  domain  to  construct  an  initial 
model,  applies  it  to  some  well-chosen  examples, 
diagnoses  any  errors,  ar.d  then  corrects  the  model 
accordingly.  For  many  applications,  this  is  a quite 
acceptable  way  of  building  models. 

In  order  to  illustrate  the  matters  presented 
above,  we  now  give  examples  of  the  operation  of  our 
prototype  system  in  a particular  domain. 

4 . An  example  domain  — TAKKSWORLD 

In  order  to  test  out  the  ideas  described  in  the 
previous  sections,  we  have  implemented  a prototype 
system  for  scene  analysis  using  constraint  filter- 
ing, and  applied  it  to  a domain  of  forward-looking 
infra-red  (FLIR)  images  of  tanks  and  other  military 
vehicles  on  fairly  open  ground.  The  image  segmen- 
tation and  region  extraction  programs  were  written 
in  the  programming  language  C.  The  constraint  fil- 
tering system  was  written  in  LISP.  This  includes 
the  constraint  filtering  procedures  themselves,  and 
also  a number  of  auxiliary  procedures,  including 
those  that  provide  the  relational  primitives  out  of 
which  the  constraints  were  constructed.  The  con- 
straints were  written  as  logical  expressions  in 
these  primitives,  using  special  conventions  to  mark 
the  variables  for  the  regions. 

Example  images  from  this  chosen  domain  (dubbed 
"TANKSWORLD")  can  be  seen  in  Figure  3.  We  admit 
five  principal  region  labels  in  TANKSWORLD: 


GROUND 

(corresponding  to  the  ground  or  any 
patch  of  ground) 

SKY 

(the  sky  or  any  patch  of  sky) 

SMOKE 

(a  puff  of  smoke  or  similar  bright 
compact  object) 

TANK 

(a  tank,  or  any  vehicle) 

TREE 

(a  tree  or  shrub) 

Only  TREE  and  TANK  have  any  real  size  restriction, 
so  only  for  these  is  over segmentation  a problem. 
Therefore  we  provide  two  additional  labels,  TANK- 
FRAGMENT  and  TREE-FRAGMENT,  to  cover  pieces  of 
these  objects. 

In  general,  the  spatial  position  of  an  object 
cannot  be  letermined  from  a single  image.  All  that 
can  be  said  is  that  the  object  must  lie  along  a 
certain  line  of  sight.  However,  we  know  that  TANKS 
and  TREES  (we  use  the  labels  here  informally  to 
stand  for  the  classes  of  objects  they  represent) 
must  stand  on  the  ground,  and  if  we  assume  that  the 
ground  is  an  approximately  level  plane,  we  can  use 
projective  geometry  and  a simple  camera  model  in 
order  to  fix  their  actual  spatial  locations,  and 
from  this  determine  their  ranges,  and  their  actual 
sizes  from  their  apparent  sizes  in  the  image.  The 
simple  camera  model  used  for  these  computations  is 
shown  in  Figure  4.  It  assumes  that  the  image  is 
formed  by  a simple  pin-hole  camera,  with  known 
parameters.  For  objects  in  space  we  use  Cartesian 
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coordinates  x,v,z,  with  the  origin  on  the  ground 
vertically  bel  w the  camera's  pinhple.  The  x axis 
runs  along  the  ground  to  the  right  from  this  origin, 
the  y axis  directly  forward,  and  the  z axis  verti- 
cally. In  the  image  we  use  coordinates  d (horizon- 
tal) and  n (vertical)  relative  to  an  origin  at  the 
-.enter  of  the  field  of  view.  These  coordinates  are 
related  by 


S 5 

f ycos'-’  - (z--h)sin® 


n _ ysinT  + (z-lpcos*!1 

7 “ ycosT  - (z-h)sinT 

where  h is  the  height  of  the  pinhole  above  the 
ground,  f is  the  distance  from  the  pinhole  to  the 
film  plane,  and  <P  is  the  dip  angle  below  the  hori- 
zontal of  the  optical  axis  of  the  camera.  These 
equations  give  an  adequate  approximation  for  any 
camera,  provided  the  field  of  view  is  not  too  wide. 
In  practice,  these  parameters  should  be  known.  For 
the  images  used  here  they  were  not  known,  but  were 
estimated  by  taking  measurements  on  the  images  of 
object!:  whose  size  was  approximately  known. 


whose  boxes  were  immediately  adjacent  or  over- 
lapping,  and  to  the  one  or  two  largest  regions  i 
each  image.  The  constraiat-f iltering  system  was  run 
on  these  examples,  and  the  results  are  presented  in 
Figures  7 and  8 . (For  each  label,  unambiguously 
labelled  regions  are  shown  in  white;  a., .aiguous  y 
labelled  regions,  which  bear  other  labels  as  well, 
are  shown  in  gray.)  In  all  cases  the  constraint 
filtering  stabilized  after  only  a few  iterations  of 
the  propagation  processes. 

As  can  be  seen,  the  results  are  quite  g°°d> 
especially  considering  the  noisiness  of  the  origi 
image,  and  the  blind  simplification  done  by  the 
segmentation  and  the  region  extraction.  A number 
of  problems  with  the  results  are  worth  discussing, 
since  they  illustrate  limitations  of  this  approach. 

Since  there  are  so  few  constraints  on  GROUND 
many  other  sorts  of  objects  will  retain  this  label. 
In  a sense,  this  is  perfectly  unobjectionable.  I 
t ,ese  images  there  is  no  way  of  distinguishing 
tank  from  a patch  of  ground  with  the  same  shape 
and  coloration  as  a tank.  In  this  domain  the  TANK 
interpretation  is  more  likely,  but  the  cons  raint 
filtering  has  no  mechanism  ^ express  ng  P 


So  for  the  labels  TANK  and  TREE,  we  have  a 
when-pi  oposed  function  that  computes  spatial  loca- 
tion on  the  ground  and  approximate  vertical  and 
horizontal  extent.  For  the  corresponding  fragments 
the  when-proposed  function  can  compute  only  bounds 
on  these  values,  but  these  bounds  are  nontheless 
useful. 


There  are  62  constraints  used  in  the  current 
model.  The  unary  constraints  are  used  to  enforce 
the  size  restrictions  on  TANKS,  TREES  and  thhir 
fragments,  limits  on  the  height  to  width  ratio  for 
TANKS  and  TREES,  and  restrictions  on  the  position 
of  SKY  and  GROUND  relative  to  the  horizon.  Th®  b*" 
ary  constraints  are  used  in  two  ways-  First,  to  ex- 
press that  the  region  for  a compact  object  such  as 
TANK  TREE,  SMOKE  cannot  surround  the  region  tor 
any  other  iort  of  object  (except  that  TANKS  and 
TREES  can  surround  their  respective  fragments). 

Second,  to  enforce  restrictions  on  the  relative 
brightness  of  objects,  that  SMOKE  is  brighter  than 
anything  else,  TANK  is  not  brighter  than  anything 
else  and  that  objects  of  the  same  class  have 
roughly  the  same  brightness,  except  for  GROUND  which 
has  considerable  variation.  (Notice  that  the  sys- 
tem is  given  no  knowledge  of  the  absolute  brigh 
nesses  of  objects— only  relative  brightness  is  used. 
tITs  was  done  deliberately  in  order  to  demonstrate 
the  ability  of  the  constraint  filtering.)  Finally, 
the  existential  constraints  capture  the  requirement 
that  TREES  and  TANKS  must  rest  upon  a piece  of 
GROUND,  and  that  e.  fragment  of  an  object  must  have 
next  to  it  another  fragment  of  the  same  sort  sue 
that  the  two  taken  together  do  not  exceed  the  size 
restrictions  for  the  Corresponding  whole  object. 

In  Figure  5,  we  show  some  typical  subimages 
from  this  domain.  Figure  6 shows  these  images  after 
segmentation  with  boxes  drawn  around  the  regions. 
Because  of  memory  limitations  of  the  present  imple- 
mentation, regions  below  a certain  size  were  ignored, 
and  interconnections  were  made  only  between  regions 


Another  problem  is  that  because  there  is  often 
.ittle  Contrast  between  sky  and  ground,  quite  a 
lumber  of  regions  straddle  the  horizon,  and  thus 
idmit  both  the  labels  SKY  and  GROUND.  It  is  clea 
that  the  segmentation  is  wrong,  but  the  current 
system  can  only  accept  uncritically  the  regions  1 
receives  from  the  segmentation.  A more  sophisti 
cated  system  could  attempt  to  modify  the  segmenta- 
tion when  such  a contradiction  was  detected.  In 
few  cases,  an  object  clear  to  the  eye  is  merged 

with  another  object  because  of  a short  segment  of 
low  contrast  boundary  between  them  and  is  thus  -.ost 
altogether.  Detecting  and  repairing  such  a mistake 
. mnni-qf-lAn  -I  c rpal  1 v nu  i 1 0 difficult. 


The  limited  context  provided  by  the  limited 
interconnection  of  regions  causes  some  difficulties. 
There  are  a few  regions  that  retain  the  label  SMOKE, 
not  because  they  are  the  brightest  regions  in  the 
image,  but  merely  because  they  are  brighter  than 
anything  they  are  connected  to.  In  some  other 
images  it  happens  that  a cluster  of  TREE-FRAGMENTS 
support  each  other,  even  though  altogether  they  are 
too  large  or  too  small  to  comprise  an  entire  TREE. 
The  system  takes  into  account  only  the  pairwise 
interactions  of  the  fragments,  without  trying  to 
them  into  a coherent  whole. 


There  are  some  other  misidentif ications  that 
can  be  blamed  on  the  simplified  shape  description 
used  here.  A number  of  odd-shaped  regions  are 
labelled  as  TREES  just  because  the  regions  happen 
to  fix  boxes  of  about  the  right  size  and  shape, 
even  though  it  is  apparent  that  they  look  nothing 
like  TREES  in  their  actual  shape. 


Despite  these  problems,  it  is  clear  that  the 
constraint  filtering  can  accomplish  almost  all  the 
task  of  analyzing  these  scenes.  In  the  next  sec- 
tion, we  will  discuss  some  of  the  issues  raised  by 
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these  problems,  and  consider  ways  of  extending  the 
constraint  filtering  process  to  overcome  them. 

5.  Discussion 


We  have  seen  in  the  previous  section  that  con- 
straint filtering  is  a feasible  technique  for  scene 
analysis,  even  if  implemented  in  a very  simple  way. 
We  will  discuss  here  some  extensions  to  this  tech- 
nique that  would  overcome  most  of  the  shortcomings 
of  the  current  approach,  and  lead  to  a more  powerful 
and  flexible  scene  analysis  system. 

One  straightforward  improvement  would  be  to 
provide  a more  accurate  shape  description  for  re- 
gions, which  would  permit  more  realistic  computation 
of  region  properties  and  relations.  The  represen- 
tation of  regions  by  boxes  is  convenient,  but  hardly 
satisfactory.  In  a few  cases,  a spurious  adjacent 
or  surround  relation  will  hold  between  thetoxes'  of 
two  regions,  when  in  fact  it  is  not  true  of  the 
regions  themselves.  This  can  lead  to  errors  in 
interpretation. 

It  would  be  desirable  to  provide  some  facility 
for  indicating  preference  between  several  labels  for 
a region,  all  logically  equally  possible,  but  one 
far  more  likely.  The  unavoidable  labelling  of 
TANKS  also  as  GROUND,  mentioned  in  the  previous 
section,  illustrates  this  problem.  More  generally, 
as  suggested  by  numerous  authors  [14,21,23],  it 
would  be  useful  to  attach  probabilities  or  confi- 
dence measures  to  all  the  hypotheses,  properties 
and  relations  in  the  system,  and  provide  a calculus 
for  combining  these  confidence  measures.  As  an 
example,  the  relation  same-brightness  just  checks 
that  the  difference  in  brightness  between  two 
regions  is  below  some  given  threshold.  In  most 
domains  this  is  unsatisfactory.  There  is  no  sharp 
cut-off  between  "same"  and  "not  the  same".  All  we 
can  say  is  that  the  greater  the  difference  in 
brightnesses  between  two  regions,  the  less  reason- 
able it  is  to  regard  them  as  having  the  same 
brightness. 

Related  to  this  is  the  need  for  a more  subtle 
combination  of  evidence.  A certain  label  may  have 
a number  of  constraints  applicable  to  it.  If  a 
certain  region  passes  all  but  one  of  these  con- 
straints it  would  lose  that  label.  But  in  some 
circumstances  it  may  be  more  reasonable  to  suspect 
that  the  labelling  is  correct  but  that  some  error 
has  been  made  in  the  evaluation  of  the  failed  con- 
straint. Perhaps  an  important  piece  of  evidence 
was  obliterated  by  noise,  occlusion,  or  poor  seg- 
mentation. Ideally  a scene  analysis  system  should 
be  able  to  tolerate  such  lost  evidence,  and  even 
attempt  to  recover  it  by  a closer  re-examination 
of  the  original  image. 

This  brings  us  to  the  matter  of  the  inter- 
action between  the  scene  analysis  system  and  the 
image  data.  In  the  current  system  there  is  a 
strictly  one-way  flow:  segmentation,  then  analysis 

of  the  segmentation.  It  would  be  preferable  to 
have  a mechanism  whereby  the  higher-level  analysis 
could,  under  certain  conditions,  call  for  a re- 
examination of  parts  of  the  original  image  in 
order  to  search  for  features  that  may  have  been 


lost  in  the  initial  processing.  The  scene  analysis 
system  should  also  be  provided  with  an  arsenal  of 
assorted  image  processing  routines,  in  addition  to 
segmentation,  in  order  to  capture  lines,  spots, 
and  other  features  that  are  likely  to  be  lost  dur- 
ing segmentation. 


The  current  scheme  for  interconnecting  regions 
into  a network,  while  effective,  is  fairly  ad  hoc. 
It  should  be  possible,  by  an  analysis  of  the  con^ 
straints,  to  make  a more  rational  decision  about 
the  connection  of  regions.  A region  may  need  only 
to  be  connected  to  regions  that  stand  in  certain 
relations  to  it  and  that  retain  certain  labels 
after  the  unary  constraint  filtering,  for  these  are 
the  only  regions  that  could  possibly  falsify  the 
applicable  constraints.  More  generally,  it  may  be 
advantageous  to  permit  the  reconfiguration  of  the 
network  during  processing,  although  this  would  have 
to  be  done  with  great  care  in  order  to  retain  the 
desirable  properties  of  constraint  filtering. 
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clumsy  notation  for  expressing  constraints  that 
apply  to  a number  of  labels.  As  mentioned  earlier, 
to  express  the  notion  that  SMOKE  is  the  brightest 
object  in  the  domain  we  must  provide  a separate 
br ighter-than  constraint  for  every  other  label  in 
the  domain.  This  could  be  overcome,  at  some  cost 
in  efficiency,  by  permitting  some  sort  of  quanti- 
fication over  labels,  allowing  constraints  like 
For  all  labels  i,  not  equal  to  SMOKE,  SMOKE  is 
brighter  than  i."  But  this  is  only  a cosmetic 
change,  which  does  not  address  the  underlying 
problem.  The  current  system  regards  all  the  labels 
as  being  quite  independent  of  each  other.  This 
becomes  quite  a serious  computational  inefficiency 
as  the  number  of  labels  becomes  large  (as  it  will 
for  any  realistic  domain),  especially  as  the  amount 
of  calculation  on  each  arc  is  roughly  proportional 
to  the  square  of  the  number  of  labels  in  the  domain. 
But  in  reality,  the  labels  in  a particular  domain 
will  usually  show  certain  similarities  among  them- 
selves and  share  many  constraints.  It  is  wasteful 
to  independently  re-evaluate  for  each  label  these 
shared  constraints.  This  inefficiency  can  be 
naturaliy  and  effectively  overcome  by  organizing 
the  labels  of  a domain  into  a taxonomy  based  on 
similarity  and  shared  constraints.  The  system 
could  initially  propose  generic  labels,  which 
stand  for  whole  classes  of  objects,  and  test  these 
labels  by  applying  only  those  constraints  common 
to  whole  classes  of  objects.  Later,  when  no  fur- 
ther progress  could  be  made  by  such  general  rea- 
soning, the  generic  labels  could  be  replaced  by 
more  specialized  labels,  and  more  specialized  con- 
coul?Jbe  applied.  For  example  in  TANKS- 
WORLD,  we  could  group  all  objects  that  must  lie 
below  the  horizon  into  a single  class,  and  elimi- 
nate this  class  label  from  all  regions  that  lie 
above  the  horizon.  Once  this  had  been  done  we 
could  specialize  objects  below  the  horizon  into 
classes  of  compact  and  extended  objects.  Later 
the  compact  objects  could  be  subdivided  into  TANKS 
and  TREES.  If  necessary,  TANKS  and  TREES  could  be 
further  classified  into  their  different  models  and 
varieties.  While  this  scheme  is  intuitively  clear 
some  work  still  remains  to  be  done  in  order  to  pro- 
perly formalize  it,  especially  in  regard  to  the 
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interactions  between  nodes  at  different  levels  of 
specialization. 

The  current  system  also  suffers  from  a one- 
level  treatment  of  network  nodes.  It  is  possible 
for  a cluster  of  regions  to  retain  the  label  TREE- 
FRAGMENT,  even  though  the  cluster,  considered  as  a 
unit,  looks  nothing  like  a TREE.  What  is  needed  is 
a mechanism  for  creating  new  nodes  having  existing 
nodes  as  parts.  This  becomes  more  acutely  neces- 
sary in  more  complex  domains  whose  objects  may  be 
built  up  from  distinct  parts.  Even  in  TANKSWORLD, 
at  slightly  better  resolution,  a TREE  would  be  seen 
to  consist  of  a trunk,  branches  and  foliage;  and  a 
TANK  would  show  wheels,  turret,  gun-barrel  and 
other  details.  Such  techniques  for  hierarchical 
constraint  filtering  have  been  studied  in  simpler 
domains  [24,25],  but  require  further  development 
for  more  complex  domains,  especially  if  they  are 
to  be  applied  in  an  efficient  manner. 

Recently,  Davis  [26]  has  shown  that  constraint 
filtering,  expressed  formally  in  logic,  can  be  re- 
garded as  a limited  form  of  inferencing.  This 
raises  the  possibility  that  more  powerful  forms  of 
constraint  filtering  could  be  devised.  Currently, 
these  techniques  work  by  falsifying  simple  hypo- 
theses about  the  individual  and  joint  identifica- 
tions of  nodes  in  network.  More  powerful  tech- 
niques could  conceivably  reason  about  other  proper- 
ties and  relations  between  nodes,  for  example, 
occlusion  relations  between  objects.  Formal  logic 
and  theorem-proving  would  also  provide  a convenient 
means  of  treating  some  of  the  other  extensions  of 
constraint  filtering  described  above. 

In  conclusion,  we  have  shown  that  constraint 
filtering  is  an  effective  means  of  scene  analysis 
in  a domain  more  complex  than  has  previously  been 
used  with  such  techniques.  The  deficiencies  of  the 
approach,  as  revealed  by  the  results  we  have  ob- 
tained, have  suggested  a number  of  improvements 
and  extensions  to  constraint  filtering.  The  devel- 
opment of  these  extensions  is  the  object  of  current 
research. 
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Figure  2.  Circumscribing  upright  rec- 
tangle ("box")  used  to  de- 
scribe a region 
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